Extending A Thesaurus By Classifying Words
نویسندگان
چکیده
This paper proposes a method for extending an existing thesaurus through classification of new words in terms of that thesaurus. New words are classified on the basis of relative probabilities of.a word belonging to a given word class, with the probabilities calculated using nounverb co-occurrence pairs. Experiments using the Japanese Bunruigoihy5 thesaurus on about 420,000 co-occurrences showed that new words can be classified correctly with a maximum accuracy of more than 80%. 1 I n t r o d u c t i o n For most natural language processing (NLP) systems, thesauri comprise indispensable linguistic knowledge. Roger's International Thesaurus [Chapman, 1984] and WordNet [Miller et al., 1993] are typical English thesauri which have been widely used in past NLP research [Resnik, 1992; Yarowsky, 1992]. They are handcrafted, machine-readable and have fairly broad coverage. However, since these thesauri were originally compiled for human use, they are not always suitable for computer-based natural language processing. Limitations of handcrafted thesauri can be summarized as follows [Hatzivassiloglou and McKeown, 1993; Uramoto, 1996; Hindle, 1990]. • limited vocabulary size • unclear classification criteria • building thesauri by hand requires considerable time and effort The vocabulary size of typical handcrafted thesauri ranges from 50,000 to 100,000 words, including general words in broad domains. From the viewpoint of NLP systems dealing with a particular domain, however, these thesauri include many unnecessary (general) words and do not include necessary domain-specific words. The second problem with handcrafted thesauri is that their classification is based on the intuition of lexicographers, with their classification criteria not always being clear. For the purposes of NLP systems, their classification of words is sometimes too coarse and does not provide sufficient distinction between words, or is some times unnecessarily detailed. Lastly, building thesauri by hand requires significant amounts of time and effort even for restricted domains. Furthermore, this effort is repeated when a system is ported to another domain. This criticism leads us to automatic approaches for building thesauri from large corpora [Hirschman et al., 1975; Hindle, 1990; Hatzivassiloglou and McKeown, 1993; Pereira et al., 1993; Tokunaga et aL, 1995; Ushioda, 1996]. Past attempts have basically taken the following steps [Charniak, 1993]. (1) extract word co-occurrences (2) define similarities (distances) between words on the basis of co-occurrences (3) cluster words on the basis of similarities The most crucial part of this approach is gathering word co-occurrence data. Co-occurrences are usually gathered on the basis of certain relations such as predicateargument, modifier-modified, adjacency, or mixture of these. However, it is very difficult to gather sufficient co-occurrences to calculate similarities reliably [Resnik, 1992; Basili et al., 1992]. It is sometimes impractical to build a large thesaurus from scratch based on only co-occurrence data. Based on this observation, a third approach has been proposed, namely, combining linguistic knowledge and co-occurrence data [Resnik, 1992; Uramoto, 1996]. This approach aims at compensating the sparseness of co~ occurrence data by using existing linguistic knowledge, such as WordNet. This paper follows this line of research and proposes a method to extend an existing thesaurus by classifying new words in terms of that thesaurus. In other words, the proposed method identifies appropriate
منابع مشابه
Extending a Thesaurus with Words from Pan-Chinese Sources
In this paper, we work on extending a Chinese thesaurus with words distinctly used in various Chinese communities. The acquisition and classification of such region-specific lexical items is an important step toward the larger goal of constructing a Pan-Chinese lexical resource. In particular, we extend a previous study in three respects: (1) to improve automatic classification by removing dupl...
متن کاملExtending a Thesaurus in the Pan-Chinese Context
In this paper, we address a unique problem in Chinese language processing and report on our study on extending a Chinese thesaurus with region-specific words, mostly from the financial domain, from various Chinese speech communities. With the larger goal of automatically constructing a Pan-Chinese lexical resource, this work aims at taking an existing semantic classificatory structure as levera...
متن کاملClassifying Adverbs based on an Existing Thesaurus using Corpus
In this paper, we try the classification of adverbs. The method for classifying adverbs is not known well, compared with nouns or verbs. The difficulty comes from vagueness of the relation between adverbs and other words. Using a corpus and an existing thesaurus, we investigate the elements which play an important role in the classification of adverbs. We got experimental results which showed t...
متن کاملWord classification based on combined measures of distributional and semantic similarity
The paper addresses the problem of automatic enrichment of a thesaurus by classifying new words into its classes. The proposed classification method makes use of both the distributional data about a new word and the strength of the semantic relatedness of its target class to other likely candidate classes.
متن کاملارائه روشی برای استخراج کلمات کلیدی و وزندهی کلمات برای بهبود طبقهبندی متون فارسی
Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997